Libraries:
The Red Wine Quality data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
For this analysis, we are mainly looking to answer one question: Which chemical properties influence the quality of red wines?
To understand a little bit better the Red Wine Quality dataset, the first step is to take a look in the summary of variables contained in it. With this summary it’s possible to check how spread is the values, by checking the min and max values. It’s also possible to have a quick understanding about the the distribution of the data, by comparing the mean and the median values.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Another interesting approach to check the data is to visualize the distribution of each feature. This can be achieved with Histogram Plots, as it is showed bellow:
As mentioned above, the data distribution analysis can be really helpful to have a quick overview about our data and it’s boundaries.
We can classify the distributions as a:
Normal distribution: Gaussian distribution (also known as normal distribution) is a bell-shaped curve, and it is assumed that during any measurement values will follow a normal distribution with an equal number of measurements above and below the mean value. (Source:https://bit.ly/2S9dYiD)
Right skewed or positive skewed distribution: In such a distribution, the mean is greater than median which in turn is greater than the mode (i.e.; mean > median > mode); in which case the skewness is greater than zero. (Source: https://en.wikiversity.org/wiki/Skewness)
Left skewed or negative skewed distribution: In such a distribution, the mean is lower than median which in turn is lower than the mode (i.e.; mean < median < mode); in which case the skewness is lower than zero. (Source: https://en.wikiversity.org/wiki/Skewness)
Based on the definitions above and on the summary(mean and median), we can classify the wine properties into of the one distributions mentioned above:
One interest transformation in the data that helps to “normalize” a distribution it’s the application of the log scale to the data:
If we look closer to a couple variables, like the Total SO2 and density, we can see how the log scale helps to normalize the variables:
As we can see, for the Total SO2, which corresponds to a Right skewed distribution, when we apply the log function we can see the shape of a normal distribution. In the other hand, the density that corresponds to a normal distribution, doesn’t loses its shape, and we get a more smooth distribution with less bumps in the shape.
To have a quick look over the correlation between two features, it’s possible to plot a matrix of plots and values. This plot is showed bellow:
Since the Matrix Plot is a little bit hard to see and the correlation numbers are sliced, I decided to generate a Scatter Plot of the Wine Quality against each feature, with the mean and median also in the plot. Also, I calculated the correlation between Wine Quality and each of the features.
Another comparision that I decided to make was a Boxplot for each of the Wine Quality against the each feature. To do this plot, I needed to add a new variable to our dataset named grade_number which corresponds to a categorical variable. This helped me to see the variaton of the data for each of the Wine Quality.
The plots and calculation are showed bellow:
In this section, we are going to analyse the variable Fixed Acidity against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
We got a small correlation between the variables, while the scatter plot and the boxplot shows that the data is really spread as expected because of the value for the correlation that was obtained.
In this section, we are going to analyse the variable Volatile Acidity against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
We got a medium negative correlation between the variables, while the scatter plot and the boxplot shows that the data is not so spread, as expected because from the value for the correlation that was obtained.
In this section, we are going to analyse the variable Citric Acid against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
We got a small to medium correlation between the variables. The boxplot shows that the data is considerable spread.
In this section, we are going to analyse the variable Residual Sugar against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
We got a very small correlation between the variables. The scatter plot and box plot shows that between this two variables there isn’t a big variation in the data, even that there are several outliers in the data.
In this section, we are going to analyse the variable Chlorides against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
We got a small correlation between the variables. The scatter plot and box plot shows that between this two variables there isn’t a big variation in the data, even that there are several outliers in the data.
In this section, we are going to analyse the variable Free Sulfur Dioxide against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
We got a very small correlation between the variables. The scatter plot and box plot shows that between this two variables there’s a good amount of variation in the data.
In this section, we are going to analyse the variable Total Sulfur Dioxide against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
We got a small correlation between the variables.
In this section, we are going to analyse the variable Density against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
We got a small correlation between the variables. The scatter plot and box plot shows that between this two variables a few negative and positive outliers in the data.
In this section, we are going to analyse the variable pH against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
We got a very small correlation between the variables.
In this section, we are going to analyse the variable Sulphates against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
We got a small correlation between the variables. The scatter plot and box plot shows a small positive correlation between the variables.
In this section, we are going to analyse the variable Alcohol against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
We got a medium correlation between the variables. The scatter plot and box plot shows that between this two variables there a medium Positive correlation. This is the strongest correlation between a feature and the wine’s quality that we got in the dataset.
The results obtained from the correlation analysis, between each feature and the Quality, were:
| Feature | Correlation | Orientation | Strength |
|---|---|---|---|
| fixed.acidity | 0.124 | Positive | Very Weak |
| volatile.acidity | -0.3905 | Negative | Weak |
| citric.acid | 0.2264 | Positive | Weak |
| residual.sugar | 0.0137 | Positive | Very Weak |
| chlorides | -0.1289 | Negative | Very Weak |
| free.sulfur.dioxide | -0.0506 | Negative | Very Weak |
| total.sulfur.dioxide | -0.1851 | Negative | Very Weak |
| density | -0.1749 | Negative | Very Weak |
| pH | -0.0577 | Negative | Very Weak |
| sulphates | 0.2514 | Positive | Weak |
| alcohol | 0.4762 | Positive | Medium |
Table Interpretation: - Strength: Very Weak(0 ~ 0.2), Weak(0.21~0.4), Medium(0.41~0.6), Strong(0.61~0.8) , Very Strong(0.8~1.0); - Orientation: Positive, Negative;
From the results we can highlight the correlations between Quality and Alcohol, Quality and Volatile Acidity, Quality and Sulphates, and Quality and Citric Acid.The other correlations have really small values, which indicates that don’t have a big impact in the Wine Quality result.
In this section, it’s necessary to generate more complex plots, by adding color to the points. This adds a new layer and open the path to the analysis of three variables, instead of only two as was did in the previous plots. For this analysis, we’ll use the variables that had the bigger values for the correlation with the Wine Quality, which from the table above are, alcohol, volatile.acidity, sulphates and citric.acid.
Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see alcohol X volatile.acid X quality:
## $title
## [1] "Alcohol x Volatile Acidity by Quality color"
##
## attr(,"class")
## [1] "labels"
##
## Pearson's product-moment correlation
##
## data: wines$volatile.acidity and wines$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2488416 -0.1548020
## sample estimates:
## cor
## -0.202288
Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see alcohol X sulphates X quality:
##
## Pearson's product-moment correlation
##
## data: wines$sulphates and wines$alcohol
## t = 3.7568, df = 1597, p-value = 0.0001783
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04477906 0.14196454
## sample estimates:
## cor
## 0.09359475
Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see alcohol X citric.acid X quality:
##
## Pearson's product-moment correlation
##
## data: wines$citric.acid and wines$alcohol
## t = 4.4188, df = 1597, p-value = 1.059e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06121189 0.15807276
## sample estimates:
## cor
## 0.1099032
Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see volatile.acidity X sulphates X quality:
##
## Pearson's product-moment correlation
##
## data: wines$sulphates and wines$volatile.acidity
## t = -10.804, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3060917 -0.2147125
## sample estimates:
## cor
## -0.2609867
Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see volatile.acidity X citric.acid X quality:
##
## Pearson's product-moment correlation
##
## data: wines$citric.acid and wines$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see sulphates X citric.acid X quality:
##
## Pearson's product-moment correlation
##
## data: wines$citric.acid and wines$sulphates
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2678558 0.3563278
## sample estimates:
## cor
## 0.31277
The results of the correlation between the analised variables are:
| Correlation | alcohol | volatile.acidity | sulphates | citric.acid |
|---|---|---|---|---|
| alcohol | - | -0.2023 | 0.0936 | 0.1099 |
| volatile.acidity | -0.2023 | - | -0.2609 | -0.5525 |
| sulphates | 0.0936 | -0.2609 | - | 0.31277 |
| citric.acid | 0.1099 | -0.5525 | 0.31277 | - |
quality is higher for a bigger alcohol and it decreases as we go to bigger values for the volatile.acid. So the wine quality and alcohol value, decreases as we increase the volatile.acidity.sulphates seems to be smaller for wines with less alcohol, and since the strongest correlation that we got to wine quality was alcohol, in general we have less quality for wines with less concentration of sulphates;quality related to the increase of the citric.acid. But to me, it’s only a weak relation, while it’s clear that the increase in the quality is more related to the alcohol concentration;volatile.acidity we get an increase in the wine quality. Also, when the concentration of sulphates increases, in general, the volatile.acidity decreases;citric.acid, we have small concentration of volatile.acidity and a better wine quality. This is also showed in the correlation value that was obtained between citric.acid and volatile.acidity, -0.5525, which shows an median to strong negative correlation;quality isn’t that strong, I couldn’t see a clear combination to the wine quality;This plot analyzes the variable Volatile Acidity against the Wine’s quality. This plots was choosen because it shows de correlation of the two variables and the scattered means of this relation.
This plot showed the negative correlation between the volatile.acidity and the wine quality. This correlation was also obtained in a previous session of this work.
In this plot we analyse dthe variable Alcohol against the Wine’s quality. This plot is very important because it shows the relation of the most important feature in the dataset to the quality variable.
This plot showed the correlation between the feature alcohol and the wine quality. This plot was really important to verify the correlation between the variable and the quality, with the biggest value of the correlation. So from the analysis of this plot was possible to see that this was the strongest relation.
Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see volatile.acidity X citric.acid X quality:
This was the most unexpected result that I got in the analysis. Turns out that the lower the value for the volatile.acidity, the bigger is the quality and the citric.acid value.
The hardest part of this analysis was that there’s no clear relation between one or two features with the wine quality, and because it’s hard to answer the question that was made in the beggining of this project:
Which chemical properties influence the quality of red wines?
Even that this was a hard question to answer, it was possible to see that the feature more related to wine quality was the alcohol. It has the biggest correlation value and the plots showed that, in general, the bigger is the value of the alcohol, the better the wine is considered.
Also, it’s important to say that the variables volatile.acidity, sulphates and citric.acid have some degree of influence on the wine quality.
For instance, these 4 features mentioned above, never were in my thoughts as the more relevant to the wine quality. I thought it would be pH and residual sugar the ones that actually had an effect in the perception of the wine’s quality. To me this showed how important is Exploratory Data Analysis to have a clear undertanding of informations and that guesses can be completely wrong from the actual information contained in data.
Personally, this project was a great challenge to me. I never used R before this class, but I got the hang of it. The challenge to know how to use this language in a way that I could extract meaningfull information from data in a way that I could show to people that don’t understand about Computing. It was hard to decide which plot would be better in each case, but I’m satisfied with my work and more confident that I can actually work with it.
For future work, I’m thinking on starting Financial Data about the stocks market. It’s a topic that calls my attention and that I think Data Science can be a really powerfull tool.